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Abstract 

Continuous availability of HPC systems built from com- 
modity components have become a primary concern as sys- 
tem size grows to thousands of processors. In this paper, 
we present the analysis of 8-24 months of real failure data 
collected from three HPC systems at the National Center 
for Supercomputing Applications (NCSA). The results show 
that the availability is 98. 7-99.8% and most outages are due 
to software halts. On the other hand, the downtime are 
mostly contributed by hardware halts or scheduled main- 
tenance. We also used failure clustering analysis to identify 
several correlated failures. 



1. Introduction 

Continuous availability of high performance computing 
(HPC) systems built from commodity components have be- 
come a primary concern as system size grows to thousands 
of processors. To design more reliable systems, a solid un- 
derstanding of failure behavior of current systems is in need. 
Therefore, we believe failure data analysis of HPC systems 
can serve three purposes. First, it highlights dependability 
bottlenecks and serves as a guideline for designing more 
reliable systems. Second, real data can be used to drive 
numerical evaluation of performability models and simula- 
tions, which are an essential part of reliability engineering. 
Third, it can be applied to predict node availability, which 
is useful for resource characterization and scheduling [ 1 1 . 

In this paper, we studied 8-24 months of real failure data 
collected from three HPC system^ at the National Center 
for Supercomputing Applications (NCSA). The remainder 
of this paper is organized as follows. In fj2]we described 
the systems characteristics and failure data collection. We 
present preliminary analysis of failure data in |J3] followed 
by failure distribution and correlation analysis in |2]and ijS] 
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' All NCSA HPC systems described in this paper have been decommis- 
sioned in 2003 and 2004. 



We summarize related work in ^and conclude our study in 

m 

2. The Systems and Measurements 

The three HPC systems we studied are quite different ar- 
chitecturally. The first is an array of SGI Origin 2000 (02K) 
machines. SGI Origin 2000 is a cc-NUMA distributed 
shared memory supercomputer An 02K can have up to 5 12 
CPUs and 1 TB of memory, all under control of one single- 
system-image IRIX operating system. The configuration at 
NCSA is an array of twelve 02K's (total 1520 CPUs) con- 
nected by proprietary, high-speed HIPPI switches. Table [T] 
lists its detailed specification. The machines A, B, E, F, and 
N are equipped with 250 MHz MIPS RIOOOO processors, 
and the rest with 195 MHz MIPS RIOOOO processors. M4 
accepts interactive access, while the others machines only 
service batch jobs. Peak performance of NCSA 02K is 328 
gigaflops. 

The second and the third systems are Beowulf-style PC 
clusters. "Platinum" cluster has 520 two-way SMP 1 GHz 
Pentium-Ill nodes (1040 CPUs), 512 of which are compute 
nodes (2 GB memory), and the rest are storage nodes and 
interactive access nodes (1.5 GB memory). "Titan" cluster 
consists of 162 two-way SMP 800 MHz Itanium- 1 nodes 
(324 CPUs), 160 of which are compute nodes (1.5 GB 
memory) and 2 are for interactive access. Both clusters use 
Myrinet 2000 and Gigabit Ethernet as system interconnect. 
Myrinet is faster and for node communications, whereas the 
Gigabit Ethernet is slower and serves I/O traffic. Both clus- 
ters have one teraflop of peak performance. 

All three HPC systems use batch job control software to 
manage workload. 02K runs LSF (Load Sharing Facility) 
queueing system. Each job on 02K have resource limits of 
50 hours of run-time and 256 CPUs. Platinum and Titan 
employ Portable Batch System with the Maui Scheduler, 
and the job limits are 352 and 128 nodes for 24 hours, re- 
spectively. 

According to a user survey |8|, the NCSA HPC sys- 
tems are devoted to multiple disciplinary sciences research: 
physics (20%), engineering (16%), chemistry (14%), biol- 
ogy (13%), astronomy (13%), and material science (12%). 



Seventy percent of users write programs in Fortran (F90 and 
F77) or mix of Fortran and C/C++. Sixty-five percent users 
use MPI or OpenMP as the parallel programming model. In 
terms of job sizes, 22% users typically allocate 9-16 CPUs. 
About equally many users (14-15%) allocate 2-4, 5-8, 17- 
32, or 33-64 CPUs. 

The failure log was collected in the form of monthly 
or quarterly reliability reports. At the end of a month or 
aquarter, a report for each node/machine is created. A re- 
port records outage date (but no outage time), type, and 
duration. There are five outage types defined by NCSA 
system administrator: Software Halt (SW), Hardware Halt 
(HW), Scheduled Maintenance (M), Network Outages, and 
Air Conditioning or Power Halts (PWR). The cause of an 
outage is determined as follows: a program runs at machine 
boot time prompts the administrator to enter the reason for 
the outage. If nothing is entered after two minutes, the pro- 
gram defaults to recording a Software Halt. 

The data collection period was two years (April 2000 to 
March 2002) for 02K and eight months (January 2003 to 
August 2003) for Platinum and Titan. In this set of fail- 
ure log, there is no occurrence of Network Outage, so we 
exclude it from the rest of analysis. 

3. Preliminary Results 

Before describing the failure data, we would like to clar- 
ify some terminology. Time to Failure (TTF) is the interval 
between the end of last failure and the beginning of next 
failure. Time between Failures (TBF) is the interval be- 
tween the beginnings of two consecutive failures. Time to 
Repair (TTR) is synonymous with Downtime. Figure [T| il- 
lustrates the differences. Because the failure log does not 
include the start and end times of outages, we can only cal- 
culate TBFs in terms of days. 

TBF ► 

TTR -»-^ TTF n TTR ► 

Downtime Uptime Downtime 
^ ^ ^Time 

Failure Failure 

Figure 1 . TBF, TTF, and TTR 

Table [T| and |2] and Figure |2] summarize the failure data 
for the three HPC systems. There are two kind of availabil- 
ity measures. The usual availability is computed as 

^ Y,{# Down CPU X Downtime) 
# Total CPU X Total time 

The scheduled availability (S Avail) removes the Scheduled 
Maintenance downtime from consideration and only counts 



scheduled uptime as total time, so it is computed as 

^ Down CPU X Unsched. Downtime) 

# Total CPU X Sched. time 

Note that in 02K's case, the twelve machines have different 
number of CPUs, so "# Down CPU" is the number of CPUs 
on the failed machine. In Platinum and Titan's case, the "# 
Down CPU" is 2. 

For the whole system of 02K, the TBF reported in Ta- 
ble [T] is actually TBF, and the downtime is the weighted 
average of individual machine downtimes: 

J2{# Down CPU X Downtime) 
# Total CPU 

From the data it is obvious that software halts account 
for most outages (59-83%), but the average downtime (i.e. 
MTTR) is only 0.6-1.5 hours. On the other hand, although 
the fraction of hardware outages is meager (1-13%), aver- 
age hardware downtime is the greatest among all unsched- 
uled outage types (6.3-100.7 hours). This is reasonable 
because hardware problems usually requires ordering and 
replacing parts and performing tests, while many software 
problems can be fixed by reboot. 

We contacted the NCSA staff about the hardware fail- 
ure causes of PC clusters. We were told that there were 
two or three cases where power supplies needed to be re- 
placed; otherwise, the main cause of hardware outages is the 
Myrinet, including network cards, cables, and switch cards. 
A network card resides at a host PC and is connected by ca- 
bles to the Myrinet switch enclosure. A Myrinet switch en- 
closure stacks many Myrinet M3-SPINE switch cards. The 
usual symptom that prompts a network card or switch card 
replacement is there are excessive CRC check errors. Some- 
times the self-testing in a switch card may fail and lead to 
replacement. Cable replacements also occurred because the 
"ping" query packets cannot get through. 

The availability is lower for 02K because when one of its 
machine is down, as much as one-sixth of the overall sys- 
tem capacity could disappear (e.g. machine B, which has 
256 CPUs.) This is unlike PC clusters in which each node 
usually contains no more than 8 CPUs, so the availability 
could degrade more gracefully, assuming the outage is not 
catastrophic such as a power failure or network partition- 
ing. Although monolithic single-system-image machines 
benefit from ease of administration, a unified view of pro- 
cess space, and extremely fast interprocess communication, 
it seems large systems composed of finer-grained manage- 
ment units are more favorable in terms of availability. 

For 02K, the machine-wise TBFs and TTRs are skewed 
toward small values. Eleven of twelve machines have 
MTBF greater than 8 days, but the medians of TBF are 
mostly smaller than 4 days. For TTR, nine machines' 
MTTR are greater than 2.5 hours, yet the medians are 0.3- 
0.9 hours. The same phenomenon also occurs on Platinum 
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Table 1. 02K Failure Data Summary 
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Table 2. Platinum and Titan Failure Data Summary 
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Figure 2. The rows from top to bottom depict weel<ly Availability, Outages, Downtime, and Failure 
Clustering (see |J5), respectively. The X axis in all plots is week. The Y axis in Downtime row is 
CPU-hours and in Failure Clustering row, the number of machines/nodes involved. 
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and Titan's node TTR. These prompt us to study examine 
closely the distributions of TBF and TTR, which we docu- 
mented our findings in the next section. 

4. Failure Distribution 

In analytical modeling, the distributions of TBF and TTR 
are key components for obtaining precise results IIT2I be- 
cause distributions of the same mean and variance can still 
yield very different outcomes. In this section, we investi- 
gate the distributions of TBF and TTR with the assumption 
that failures and repairs are all independent. 

We first choose a set of distributions as our parametric 
probability models and seek the parameters that best fit the 
data to these models. An open-source statistical package 
called WAFO [2J is used to find parameters. Then we ap- 
ply chi-square test as goodness-of-fit test to pick the best- fit 
distribution. 

Our selection of probability models includes exponen- 
tial, gamma distribution and a family of heavy-tail distri- 
butions (WeibuU, Truncated Weibull, Log-normal, Inverse 
normal, and Pareto [4]). Heavy-tail means the complemen- 
tary cumulative distribution function 1 — F{x) decays more 
slowly than exponentially. Heavy-tail distributions are cho- 
sen because many failure data studies (e.g. ifTOl [3]) have 
shown that they are actually more prevalent than exponen- 
tial distribution, which is commonly assumed in probability 
models to make analysis tractable. 

For each system, we conglomerate TBF and TTR data of 
all machines/nodes and present their distributions and fitting 
functions in Figure [3] For 02K, the TTR is fit by Inverse 
normal /(x) = 1.87(27ra;3)-o.5 gxp(_i2.76(a;-0.37)Va;) 
and TBF by Weibull F{x) = 1 - exp(-5.61a;"-5). For 
Platinum, the TTR is fit by Truncated Weibull F{x) = 
1 - exp(-6.79(x + 0.14)'^ i5 + 5.07) and TBF by Expo- 
nential F{x) = 1 - e^o '^'^^. For Titan, the TTR is fit by 
Gamma f{x) = 0.27x-"-^^e-°-^^'' and TTR by Exponen- 
tial F{x) = 1 - e-°"37^. 

The distributions of Titan's failure data have staircase- 
like shapes unfound in other two systems'. For example, 
there are two sudden shoot-ups at 1.4 hour and 6.8 hour 
in Titan's TTR distribution. The shoot-ups mean that there 
were massive nodes down for about the same period of time, 
which implies a possibility of correlated failure. To under- 
stand this anomaly, we perform a failure correlation analy- 
sis, as described in the next section. 

5. Failure Correlation 

In the last section we assumed the failures are indepen- 
dent and derived the failure distribuion. Failure indepen- 
dence is a common assumption in rehabihty engineering to 



0.8 
, 0.6 
0.4 
0.2 



0.8 
, 0.6 
0.4 

0-2 
1 

0.8 
. 0.6 
0.4 
0.2 

1 

0.8 
, 0.6 
0.4 



0.8 
. 0.6 
0.4 
0.2 





20 


40 60 80 100 




02K: TTR 


5 


10 15 20 


: ^^^^"^ ; 

J Platinum TBF 


10 


20 30 40 SO 


f 


Platinum TTR 


1 


2 3 4 


/ 

/ 


/ r ; 

Titan TBF 



20 40 60 80 100 



0.8 






„ 0.6 
0.4 




Titan TTR 




0.2 




4 6 


8 



Figure 3. Distributions of node TBF and TTR. 
Dashed line is the fitting distribution. The X 
axis in TBF plots is day and in TTR plots, hour. 



5 



simplify analysis and system design. However, many sta- 
tistical tests and log analyses showed that real-world dis- 
tributed computing environments do exhibit correlated fail- 
ures. 

In this section, we investigate how outages of different 
machines relate to each other by clustering approach ifTjI . 
Roughly speaking, this approach groups failures which are 
close either in space or in time. It should be emphasized 
that the correlation resulted from clustering is purely statis- 
tical and does not imply the failures really have cause-and- 
effect (causal) relationship. Since our collection of failure 
log lacks error details, we can only rely on statistics to find 
correlation. 

To not confuse with the word "cluster" in "PC clusters," 
we will refer to a failure cluster as a "batch." We define a 
batch to be a time period [Ti , T2] in which every day there 
is at least one outage (regardless of type), and no outages 
occur on day Ti — 1 or r2 + 1. Put another way, we co- 
alesce into a batch the failures of different machines/nodes 
that occur in consecutive days. The bottom row of Figure |2] 
illustrates the results. The width and height of a rectangle 
indicate the duration and the machine/node count of that 
batch, respectively. 

Using this method, we found there are 79 batches for 
02K, accounting for 55 percent of all outages. Eight-five 
percent of batches last for no more than three days, and 
89 percent of batches involve no more than four machines. 
There are four batches that involve all twelve machines. In 
week 31, the failure was caused by power or air condition- 
ing problem and was followed a two-day maintenance. In 
week 35, there was a system-wide maintenance on the first 
day, but some machines experienced hardware halts and all 
were again taken offline for maintenance on the second day, 
and all machines had short software problems on the last 
day. In week 78, a system maintenance occurred and lasted 
37-91 hours. The last catastrophic outage occurred on week 
97 due to power problems. Note that the massive outages in 
week 31,35, and 78 are also reflected as spikes in Availabil- 
ity and Downtime plots. 

The failure clustering plot also reveals some possible 
failure correlation in Platinum and Titan systems. Statis- 
tically speaking, the chance of a batch having a great deal 
of outages in a short time (e.g. the razor-thin rectangles in 
the bottom row of Figure |2]l is close to zero. Thus, a rea- 
sonable explanation for such an occurrence is failure corre- 
lation. To justify this claim, we take Platinum system as 
an example. There is a batch in week 4 which contains 
501 nodes in one day. If we assumes failures are indepen- 
dent and TBF has exponential distribution, then the num- 
ber of failures in a given duration follows Poisson distribu- 
tion. So the chance of at least 501 outages in one day is 
Er=50i(e"^°30"/n!) = 6.3 x IQ-i^ where 30 is the av- 
erage number of outages per day of Platinum system. After 



checking the log, it shows that particular outage is Software 
Halt and gives 5-15 minutes downtime. 

Titan system's failure correlation is even more conspic- 
uous. The three peaks represent massive outages at week 
10, due to a 9 minute software halt followed by 6.8 hours of 
hardware halt, at week 14, due to 1 .4 hours of power failure, 
and at week 21, due to 6.8 hours of power failure. The 1.4 
and 6.8 hours of downtime explains the two sudden rises in 
Titan's TTR distribution in Figure |3] as most nodes experi- 
enced them. The three staircases in Titan's TBF distribution 
reflect the intervals among the three massive outages, which 
are 64, 29, and 48 days. As in 02K's case, the three out- 
ages of Titan are also mirrored in Availability, Outages, and 
Downtime plots. 

6. Related Work 

Field failure data analysis of very large HPC systems 
is usually for internal circulation and is almost never pub- 
lished in detail. Nevertheless, there are several talks and 
reports that shed light on the administration experience of 
some of the world's most powerful supercomputers. 

Koch reported the situtaiton of ASCI White. A 
whole-system reboot of ASCI White takes 4 hours and pre- 
ventive maintenance is performed weekly, with separate pe- 
riods for software and hardware. Machine problems oc- 
curred in every aspect of the system. Transient CPU faults 
generated invalid floating-point numbers, and it took great 
effort to spot these corrupted nodes because they passed 
standard diagnostic tests and only failed in real programs. 
Bad optical interconnects led to non-repeatable link errors 
which corrupted the computation because these errors could 
sneak through network host firmware without being de- 
tected. The storage system was not 100% dependable ei- 
ther The parallel file-system sometimes failed to return I/O 
error to the user program when the program was dumping 
restart files. In addition, the archival subsystem's buggy 
firmware corrupted restart files and made the user program 
fail to launch. 

Seager [[TT| showed that the reUabiUty of the ASCI 
White improved over time as MTBF increased steadily from 
as short as 5 hours in January 2001 to 40 hours in Febru- 
ary 2003. Except uncategorized failures, the storage sys- 
tem (both local disks and IBM Serial Disk System) is the 
main source of hardware problems. Next to disks is CPU 
and third-party hardware troubles. For software, communi- 
cation libraries and operating systems contributed the most 
interruptions. 

Morrison [9] reported operations of the ASCI Q during 
June 2002 thru February 2003. The MTBI (mean time be- 
tween interruption) is 6.5 hours, and on the average there 
were 114 unplanned outages per month. Putting storage 
subsystem aside, hardware problems account for 73.6% of 
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node outages, with CPU and memory modules being re- 
sponsible for over 96% of all hardware faults (CPU is 62.5% 
and memory is 33.6%.) Network adaptors or system boards 
seldom fail. 

Levine |7 | described the failure statistics of Pittsburgh 
Supercomputing Center's supercomputer Lemieux: MTBI 
during April 2002 to February 2003 is 9.7 hours, shorter 
than predicted 12 hours. The availability is 98.33% during 
mid-November 2002 to early February 2003. 

The National Energy Research Scientific Computing 
Center (NERSC) houses several supercomputers and their 
operations are summarized in NERSC's annual self- 
evaluation reports [6|. During August 2002 to July 2003, 
their largest supercomputer Seaborg reached 98.74% sched- 
uled availability, 14 days MTBI, and 3.3 hours MTTR. Stor- 
age and file servers had similar availability. Two-thirds of 
Seaborg's outages and over 85% of storage system's out- 
ages are due to software. 

7. Conclusions 

In this paper we reported the failure data analysis of three 
NCSA HPC systems, one of which is an array of distributed 
shared memory mainframes and the rest are PC clusters. 
The results show that the availabihty is 98.7-99.8%. Most 
outages are due to software halts, but the downtime per out- 
age is highest due to hardware halts or scheduled mainte- 
nance. We also sought the distributions of time-between- 
failures and time-to-repairs and found some of them exhibit 
heavy-tail distributions instead of exponential. Finally, we 
applied failure clustering analysis and identified several cor- 
related failures. Because failure data analysis of HPC sys- 
tem is scarce, we believe this paper provides very valuable 
information for researchers and practioners working on re- 
liability modeling and engineering. 
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